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Comprehensive knowledge of protein-ligand interactions should provide a useful basis for annotating protein functions, studying 
protein evolution, engineering enzymatic activity, and designing drugs. To investigate the diversity and universality of ligand 
binding sites in protein structures, we conducted the all-against-all atomic- level structural comparison of over 180,000 ligand binding 
sites found in all the known structures in the Protein Data Bank by using a recently developed database search and alignment 
algorithm. By applying a hybrid top-down-bottom-up clustering analysis to the comparison results, we determined approximately 
3000 well-defined structural motifs of ligand binding sites. Apart from a handful of exceptions, most structural motifs were found 
to be confined within single families or superfamilies, and to be associated with particular ligands. Furthermore, we analyzed the 
components of the similarity network and enumerated more than 4000 pairs of ligand binding sites that were shared across different 
protein folds. 
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Introduction Most proteins function by interacting with 
other molecules. Therefore, the knowledge of interactions 
between proteins and their ligands is central to our under- 
standing of protein functions. However, simply enumerat- 
ing the interactions of individual proteins with individual 
ligands, which is now indeed possible owing to the massive 
production of experimentally determined protein struc- 
tures, would only serve to increase the amount of data, 
not necessarily our knowledge or understanding, of pro- 
tein functions. What is needed is a classification of general 
patterns of interactions. Otherwise, it would be difficult 
10 apply the wealth of information t o elucidate the evolu- 
tiona r y history o f prote in functions (jAndreeva fc MurzinL 
20061 Goldste in), [jooih. to engineer enzymatic activity 



of di fferent folds (Kobavashi & Gol l 1997; Kinoshita et al^ 
Il999l : IStark fc Russell l20oJ~ IBrakoulias fc Jackson . 



2004; IShulman-Peleg et at. . I2Q04J: iGoldfc Jacksonl .12006). 



(iGutteridpe fc Thornton! . l2QQ5h . or to develop new drugs 
(|Rognanil2007h . 

In order to classify protein-ligand interactions and to ex- 
tract general patterns from the classification, it is a prereq- 
uisite to compare the ligand binding sites of different pro- 
teins. There are already a number of methods to compare 
the atomic structures or other structu ral features of func- 
tional sites of protein s (see reviews, Ijones fc Thornton! 
l2004HLee ef~^Il2007h . 

Applications of these methods lead to the discoveries 
of ligand binding site structures shared by many proteins 
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iGold fc Jackson! conducted an all-against-all com- 

parison of 33,168 binding sites, the results of which have 
been compiled into the SitesBase database. They have 
described several unexpected similarities across different 
protein folds and applied their method to the annotation 
of unclassified proteins. More recently, iMinai etall ( 2008 ) 
compared all pairs of 48,347 potential ligand binding sites 
in 9708 representative protein chains, and demonstrated 
the applicability of ligand binding site comparison to drug 
discovery. 

To date, however, no method has been applied to 
the exhaustive all-against-all comparison of all ligand 
binding sites f ound in the Protein Data Bank (PDB) 
( Berman et all l2QQ7h . presumably because these methods 
were not efficient enough to handle the huge amount of 
data in the current PDB, or because it was assumed that 
the redundancy (in terms of sequence homology) or some 
"trivial" ligands (such as sulfate ions) in the PDB did 
not present any interesting findings. As of June, 2008, the 
PDB contains over 51,000 entries with more than 180,000 
ligand binding sites excluding water molecules, and hence 
naively comparing all the pairs of this many binding sites 
(> 3 x 10 10 pairs) is indeed a formidable task. Neverthe- 
less, multiple structures of many proteins that have been 
solved with a variety of ligands (e.g., inhibitors for en- 
zymes) could provide a great opportunity for analyzing 
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the diversity of binding modes, and some apparently triv- 
ial ligands are often used by crystallographers to infer the 
functional sites from the "apo" structure. In other words, 
the diversity of these apparently redundant data is too 
precious a source of information to be ignored. 

To handle this huge amount data, we have recently 
developed the GIRAF (Geometric Indexing with Re- 
fined Alignment Finder) method (| Kin jo fc NakamuraL 
120071). By combinin g idea s from geometric hashing 
(|Wolfson fc Rigoutsosi 1 19971) and r elational database 
searching ([Garcia-Molina et 

M l2QQ2h . this method can 
efficiently find structurally and chemically similar local 
protein structures in a database and produce alignments 
at atomic resolution independent of sequence homology, 
sequence order, or protein fold. In this method, we first 
compile a database of ligand binding sites into an ordi- 
nary relational database management system, and create 
an index based on the geometric features with surround- 
ing atomic environments. Owing to the index, potentially 
similar ligand binding sites can be efficiently retrieved and 
unlikely hits are safely ignored. For each of the potential 
hits found, the refined atom-atom alignment is obtained 
by iterative applications of bipartite graph matching and 
optimal superposition. In this study, we have further im- 
proved the original GIRAF method so that one-against-all 
comparison takes effectively one second, and applied it to 
the first all-against-all comparison of all ligand binding 
sites in the PDB. 

In order to extract recurring patterns in ligand bind- 
ing sites, we then classified the ligand binding sites based 
on the results of the all-against-all comparison, and de- 
fined structural motifs. So far, such structural motifs h ave 
been determin ed either manua l ly (|Port er et al.< 2004h or 
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Fig. 1. Summary of experiment. A: Flow of the analysis. B: His- 
togram of the number of matches per ligand binding site. 



Results 



All-against-all comparison of ligand binding sites 
Out of 51, 2 89 en tries in the Protein Data Bank 
( Berman et aR l2QQ7h as of June 13, 2008, all 186,485 



ligand binding sites were extracted and compiled into a 
database. A ligand binding site is defined to be the set of 
protein atoms that are within 5A from any of the corre- 
sponding ligand atoms. To define a ligand, we used the 
annotations in PDB's can onical XML (extensible markup 
language) files (PDBML) (jWestbrook et q/1 l2QQ5h because 
these annotations are more accurate than the HETATM 
record of the flat PDB files. Our definition of ligands in- 
cludes not only small molecules, but also polymers such 
as polydeoxyribonucleotide (DNA), polyribonucleotide 
(RNA), polysaccharides, and polypeptides with less than 
25 amino acid residues; water molecules and ligands con- 
sisting of more than 1000 atoms were excluded. We did not 

^ a ^? &ny ^ U ^L etd i ™J P0kC ! .^ Bab ^ tt| ' exclude " trivial " n S ands such as sulfate ( S °4 _ ). Phosphate 

(PO4 - ), and metal ions. We did not use a representative 

set of proteins based on sequence homology to reduce the 

data size. 

In total, the all-against-all comparison yielded 38,869,791 
matches with P- value < 0.001 with 208 matches per site 
on average (Fig. 1A). While 5014 sites found no hits other 
than themselves, 8369 sites found more than 1000 matches. 
When we limit the matches to more stringent P-value 
thresholds (lO -10 , 10~ 15 , lO -20 ), the long tail of the large 
number of matches rapidly disappears (Fig. IB), indicat- 
ing that many matches reflect partial and weak similarities 
between sites. 



l2QQ6h , Given the huge amount of data, manual curation 
of all potential motifs is not feasible, and previously de- 
velop ed automatic methods are computationally too inten- 
sive (jWangikar et all 120031 ) or limited in scope (e.g., bein g 
based on sequence alignment ( Polacco fc Babbitt! [2006)). 



Therefore, we first applied divisive (top-down) hierarchical 
clustering to obtain single-linkage clusters from the simi- 
larity network of ligand binding sites which can be read- 
ily obtained from the result of the all-against-all compari- 
son. Based on the hierarchy of the single-linkage clusters, 
agglomerative (bottom-up) complete-linkage clustering is 
then applied. Thus obtained complete-linkage clusters are 
shown to be well-defined structural motifs, and are then 
subject to statistical characterization regarding their lig- 
and specificity and protein folds. 

Furthermore, based on the result of the all-against-all 
comparison, we study the structure of the similarity net- 
work of ligand binding sites, and enumerate interesting sim- 
ilarities shared across different folds. The list of clusters 
and the list of pairs of ligand binding sites not sharing the 
same fold are available on- 

linG 



1 http:/ /pdbjs6. pdbj.org/~akinjo/lbs/ 



Relationship between similarities of protein sequences 
and ligand binding sites As noted above, the present data 
set is highly redundant in terms of sequence homology. If 
the similarity of ligand binding sites is sharply correlated 
with that of amino acid sequences, it would have been bet- 
ter to use sequence representatives. To justify the use of 
the redundant data set, we carried out an all-against-all 
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Fig. 2. Relationship between sequence similarity and ligand binding site similarity. A: Sequence identity of BLAST hits versus GIRAF 
P- values. B: Sequence identity of BLAST hits versus root-mean-square deviation of aligned ligand binding sites found by GIRAF. C: Sequence 
identity of BLAST hits versus the number of ligand bind site atoms aligned by GIRAF. 



BLAST ([Altschul et all I1997I ) search of all protein chains 
of the present data set, and checked the correlation between 
sequence identity and GIRAF P-value (Fig. 2A). It should 
be noted that a ligand binding site may reside at an in- 
terface of more than two protein subunits (chains), which 
complicates the notion of representative chains. Therefore, 
we defined sequence similarity between two PDB entries as 
the maximum sequence identity of all the possible pairs of 
chains from the two PDB entries. 

While there was a significant but very weak negative 
correlation between the GIRAF P-value and percent se- 
quence identity (Pearson's correlation -0.14), there were 
many strikingly similar (GIRAF P-value < 10 -50 ) pairs of 
ligand binding sites with low (< 30%) sequence identity, 
and there were also many weakly similar ligand binding 
sites (GIRAF P-value > 10~ 20 ) at high (> 90%) sequence 
identity region. This tendency was also confirmed by using 
more conventional measures of similarities. Although the 
root-mean-square deviation (RMSD) of aligned atoms ex- 
hibited a stronger negative correlation with the sequence 
identity (Fig. 2B; Pearson's correlation -0.46), the range 
of scatter of RMSD was so large that it was not possible 
to distinguish the range of sequence identity from RMSD 
values and vice versa. In addition, the number of aligned 
atoms did not correlate with the sequence identity (Fig. 
2C), indicating that the local structures of ligand binding 
sites can be strictly conserved among distantly related pro- 
teins. Visual inspection suggested a few possible reasons for 
the large deviation in the region of high sequence similar- 
ity. First, the binding sites do not necessarily overlap com- 
pletely when different ligands are complexed with (almost) 
identical proteins. Second, many binding sites are flexible, 
yet they are able to bind the same ligand. Third, some lig- 
ands are flexible and can be bound as different conformers, 
which in turn causes structural changes of the binding site. 

One of the rationales for an exhaustive all-against- 
all comparison is that some similarities between non- 
representative proteins would be ignored when only se- 
quence representatives were used. For example, in the 
results of a comparison of potential ligand binding sites 
of 9708 sequence representative proteins conducted by 



Minai et all ((2008), the similarity between the ADP bind- 
ing sites of hu man inositol (l , 4,5)-t riphosphate 3-kinase 



(PDB: 1W2D ([Gonzalez et all . \2004 ): SAICAR synthase 



like fol d) and of Archaeoqlobus fulqidus Rio2 kinase (PDB: 
1ZAR (Laronde-Leblanc et al 1 l2005h : Protein kinase-like 
fold) was not detected although this match was found to 
have P-value of 8.1 x 10" 17 (40 aligned atoms; RMSD 
0.75 A) in the present result. Furthermore, equivalent 
matches were found in not all homologs o f these two pro- 
teins. We note, however, that iMinai et all (|2QQ8h did find 
an equivalent similarity between the binding sites of these 
protein folds, but it was based on apo structures which 
were not treated here. Thus, the similarity not detected by 
Minai et al. is likely to be due to the use of representatives, 
but not due to the difference in sensitivity of their method 
and the present one. 

We conclude that the similarity of sequences and that 
of ligand binding site structures are weakly correlated, but 
the correlation is not strong enough to infer the one from 
the other. 



Defining structural motifs of ligand binding sites We 
have seen that sequence representatives are not suitable for 
studying the diversity of ligand binding sites. The use of the 
raw data of ligand binding sites for statistical analysis, how- 
ever, would be problematic due to some over-represented 
and under-represented binding sites. Therefore, it is prefer- 
able to remove the redundancy based on the ligand binding 
similarity itself. Furthermore, a list of pairwise similarities 
is not sufficient for characterizing typical patterns of bind- 
ing modes. Accordingly, we applied the hybrid top-down- 
bottom-up clustering method to obtain complete-linkage 
clusters based on P-value. In a complete-linkage cluster 
(hereafter referred to as 'cluster'), any pair of its mem- 
bers are similar within the specified P-value threshold. As 
such, clusters may be regarded as precisely defined struc- 
tural motifs of ligand binding sites, and hence we use the 
term 'cluster' and 'structural motif (or simply 'motif') in- 
terchangeably when appropriate. Based on the analysis of 
similarity networks with varying thresholds (see below), we 
set the threshold to 10 -15 in the following analysis. 

It is immediately evident that there are a large number 
of small clusters and a small number of large clusters (Fig. 
3A). Excluding 58,001 singletons (clusters with only one 
member), there were 20,224 clusters which accounted for 
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Fig. 3. Statistical properties of structural motifs. A: Size of complete-linkage clusters defined with P-value thresholds of 10 -15 . B: Scatter plot 
of cluster size versus ligand types found in cluster. C: Histogram of the number of ligand types per structural motif (cluster). D: Histogram 
of the number of structural motifs (clusters) associated with a given ligand type. E: 30 most abundant ligand types (polymer molecules are 
marked with an asterisk). 



128,484 (69%) of all the 186,485 sites. Out of these clusters, 
2959 clusters consisted of at least 10 sites, accounting for 
69,748 (37%) sites. The list of these clusters of structural 
motifs is available on- Since the ligand binding sites 

in small clusters are not reliable due to statistical errors, 
we use only the 2959 clusters consisting of at least 10 sites 
in the following analysis unless otherwise stated. 



Diversity of structural motifs with respect to ligand types 

Although some structural motifs included binding sites 
for a wide variety of ligand types, this is not always the 
case (Fig. 3B). Here, each PDB chemical component iden- 
tifier (consisting of 1 to 3 letters) corresponds to a ligand 
type except for peptides, nucleic acids or sugars, which 
were treated simply as such (i.e., polymer sequence iden- 
tity is ignored). Large clusters associated with many kinds 
of ligands were almost always enzymes such as proteases 
(eukaryotic or retroviral), carbonic anhydrases, protein 
kinases and protein phosphatases, whose structures have 
been solved with a variety of inhibitors. For example, two 
structural motifs consisting of 246 and 147 ligand binding 
sites of eukaryotic (trypsin-like) proteases were associated 
with 106 and 80 ligand types, respectively; two motifs con- 
sisting of 197 and 115 sites of retroviral proteases with 82 
and 62 ligand types, respectively; a motif of 63 sites of pro- 
tein kinases with 58 ligand types. On the contrary, large 
clusters with a limited variety of ligands were binding sites 
for heme (globins and nitric oxide synthase oxygenases) or 



metal ions. Each structural motif is associated with 3.2 lig- 
and types on average (standard deviation 5.3): 1322 motifs 
(47%) with only one ligand type and 2807 motifs (95%) 
with less than 10 ligand types whereas only 34 motifs con- 
tained more than 20 ligand types (Fig. 3C). In general, the 
diversity of ligand types per structural motif is low. 

The converse is also true. That is, the number of struc- 
tural motifs associated with each ligand type is generally 
very limited with the average of 2.1 motifs (standard devi- 
ation 8.4) per ligand type (Fig. 3D), and 3791 ligand types 
correspond to single motifs. Nevertheless, there were some 
ligands which were associated with many motifs (Fig. 3E). 
As expected, ligands often included in the solvent (e.g., 
S04 [sulfate], MG [magnesium ion], GOL [glycerol], EDO 
[ethanediol] ) were found in many motifs. Reflecting a large 
number of possible sequences, polymer molecules includ- 
ing peptide, sugar, and DNA were also found to be bound 
with many motifs, respectively. Other than these, mononu- 
cleotides and dinucleotides and metal ions exhibited a wide 
range of binding modes. 



http: / / pdbjs6.pdbj .org/ ~akinjo/lbs / cluster. xml 



Diversity with respect to protein families and folds 
Not many, but some structural motifs were found to con- 
tain ligand binding sites of distantly related proteins. To 
quantitatively analyze the diversity of structural motifs in 
terms of homologous families and global structural simi- 
larities, we assigned protein family, super family, fold and 
classes to each stru ctural motif according to the SCOP 
( Murzin et all fl995l ) database. More concretely, the most 
specific SCOP code (SCOP concise classification string, 
SCCS) was assigned to each motif that was shared by all 
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members of the corresponding cluster when it was possi- 
ble, otherwise (i.e., there is at least one member that is 
different from other members in the cluster at the class 
level), motif was categorized as "others" (Fig. 4A). 

Out of 2705 motifs to which SCCS can be assigned, 2637 
and 62 motifs shared the same domains at the family and 
superfamily level, respectively. Thus, more than 99% of the 
motifs (of at least 10 binding sites) only contained binding 
sites of evolutionarily related proteins. One motif contained 
proteins from different superfamilies but of the same fold. 
This motif corresponded to the heme binding site of heme- 
binding four- helical bundle proteins (SCOP: f.21). Five mo- 
tifs accommodated similarities acros s different folds , out o f 
which three were zinc binding motifs ([Krishna et all 2003 ). 



Number of motifs 



One motif contained a P-loop motif which is shared be- 
tween the P-loop containing nucleotide triphosphate hydro- 
lases (NTH) (SCOP: c.37) and the PEP carboxy kinase-like 
fold (SCOP: c.91) (Fig. 5A) (|Tari et all fl996h . One mo- 
tif was of the nucleotide-binding sites from FAD/NAD (P)- 
binding domain (SCOP: c.3) and Nucleotide-binding do- 
main (SCOP: c.4) (Fig. 5B). Note that some PDB entries 
have not yet been annotated in SCOP. Currently, if such 
members exist in a cluster, they are simply ignored, and 
the assigned SCCS is based only on the members whose 
SCCS is known. Therefore, the number of motifs not shar- 
ing the same folds is somewhat underestimated. Neverthe- 
less, it seems a general tendency that most motifs are con- 
fined within homologous proteins, namely families or su- 
perfamilies. 

It was shown above that sequence similarity was only 
weakly related to the structural similarity of ligand bind- 
ing sites (Fig. 2). This point can be further clarified by 
examining motifs of similar binding sites of related pro- 
teins. For example, t he peptide binding; sites of a pig 



trypsin (PDB: 1UHB (Pattab hi et all 12 004)) and of a hu- 



20C 

man hepsin (PDB: 1Z8G ( Herter et all 12005^ were both 
in the same cluster but they share little sequence similar- 
ity (5% sequence identity based on a structural a lignment 
( Kawabata fc Nishikawal . l2000l : iKawabatal l2003h ). while 
the peptide binding site of bovine trypsin (PDB: 1QB1 
( Whitlow et all fl999)) in another cluster shares 81% se- 



quence identity with the pig thrombin in the previous 
cluster. This observation can be explained by the fact that 
different motifs cover different regions of proteins even 
though they are spatially close or even partially overlap- 
ping. The same argument applies to other motifs of related 
proteins. Thus, the structural motifs distinguish subtle 
differences in ligand binding site structures independent of 
sequence similarity. 

It has been known that some protein folds can accom- 
modate a wide range of functions. It is expected that the 
diversity of function is reflected in that of structures of lig- 
and binding sites. To analyze such tendency, we counted 
the number of motifs that belong to each protein fold (Fig. 
4B). Only a handful of folds showed a large diversity in 
terms of structural motifs. On average, 8.9 motifs were as- 
signed to a fold. Out of 332 folds used in the analysis, 
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TIM beta/alpha-barrel [c.1] 
P-loop containing NTP hydrolases [c.37] 
NAD(P)-binding Rossmann-fold [c.2] 
Immunoglobulin-like beta-sandwich [b.1] 
Ferredoxin-like [d.58] 
Phosphorylase/hydrolase-like [c.56] 
PLP-dependent transferases [c.67] 
Protein kinase-like (PK-like) [d.144] 
Ferritin-like [a.25] 
EF Hand-like [a.39] 
Trypsin-like serine proteases [b.47] 
Zincin-like [d.92] 
Concanavalin A-like lectins [b.29] 
Globin-like [a.1] 
Cupredoxin-like [b.6] 
alpha/beta-Hydrolases [c.69] 
FAD/NAD(P)-binding domain [c.3] 
Heme-dependent peroxidases [a.93] 
Flavodoxin-like [c.23] 
Bacterial photosystem II [f.26] 



Fig. 4. Diversity of structural motifs in terms of protein fo l ds. A : 
Number of motifs to which the given SCOP (jMurzin et all . Il995h 
hierarchical level (family, superfamily, fold, class) can be assigned. B: 
Histogram of the number of structural motifs associated with each 
SCOP fold. C: 20 most diverse SCOP folds in terms of the number 
of associated structural motifs. 

only 18 contained more than 30 motifs (Fig. 4C). Among 
them, the TIM barrel fold was an extreme case with 183 
motifs assigned, reflecting the great diversi ty of its func- 



tions ( Nagano et al. , 20021 ) . Some superfolds ( Orengo et all 



1991 such as Rossmann-fold, immunoglobulin-like, globin- 
like, etc. also showed great diversities of ligand binding 
sites. 



Similarity network of ligand binding sites While each 
motif defines a precise pattern of ligand binding mode, 
the members of different structural motifs share signifi- 
cant structural similarities with each other. To explore the 
global structure of the 'ligand binding site universe,' we 
constructed a similarity network based on the results of 
the all- against- all comparison. Each structural motif was 
represented as a node and two nodes were connected if a 
member of one node was significantly similar to a member 
of the other node (i.e., the P-value of their alignment was 
below a predefined threshold). Thus constructed network 
can be decomposed into a number of connected compo- 
nents. When the threshold was greater than 10 -14 , the size 
of the largest connected component of the network was one 
or two orders of magnitude greater than that of the second 
largest one (Fig. 6A). For example, setting the threshold 
to 10 -10 yielded the largest connected component consist- 
ing of 78,190 sites (i.e., 42% of 186,485 sites). Accordingly, 
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Fig. 5. Examples of structural motifs shared by different protein folds. 
The left panel shows the whole protein structures (colored in blue 
or pink) superimposed based on the alignment of the ligand binding 
sites shown in the right panel (colored in the CPK scheme or ma- 
genta (protein) and green (ligand), respectively) . A: ADP bindin g 
site of bacterial shikimate kinase (PDB: 2DFT (|Dias et all l2007h : 
SCOP: c.37; blue / CPK-colored) and ATP bindi ng site of bacteria l 
phosphoenolpyruvate carboxykinase (PDB: 1AQ2 (|Tari et aZlll997^ : 
SCOP: c.91; pink / protein in magenta, ADP in green). B: FAD bind- 
ing site of human glutathione reductase (PDB: 5GRT (Stol l et all 
Il997h : SCOP: c.3; blue / CPK-colored) and ADP bind ing site of bac- 
terial trimethylamine dehydrogenase (PDB: 2TMD dBarber et all 
Il992h : SCOP: c.4; pink / protein in magenta, ADP in green). 

many functionally unrelated binding sites were somehow 
connected in the largest component, which complicated the 
interpretation of the component. With the P- value thresh- 
old of 10 -15 or less, the first several connected components 
were of the same order (Fig. 6A), and many members of 
each component appeared to be more functionally related. 
Thus, we set P = 10 -15 for constructing the network in 
the following (as well as for defining the complete-linkage 
clusters described above). 

Excluding 54,092 singleton components (those consisting 
of only one site), 11,532 connected components were found. 
The largest component consisted of 7935 sites, and 1881 
components contained at least 10 sites (Fig. 6A). 

The main constituents of the largest connected com- 
ponent of the similarity network (Fig. 6B) were mononu- 
cleotide (ADP, GDP, etc.) or phosphate binding (P04) 
sites. Most notable were P-loop containing NTH (SCOP: 
c.37) and PEP carboxykinases (SCOP: c.91) which formed 
a closely connected group as they share similar phosphate 
binding sites, i.e., the P-loop motif (the term 'group' used 
here indicates closely connected clusters in a network com- 
ponent colored in green in Fig. 6B-F). Directly connected 
with this group was the coenzyme A (Co A) binding site of 
acetyl-CoA acetyltransferases. The magnesium ion (MG) 
binding site of Ras-related proteins were also connected 
with the group of the P-loop containing proteins since the 
magnesium ion is often located near the phosphate binding 



site. Mononucleotide or phosphate (AMP,U5P,PRP, P04) 
binding sites of various phosphoribosyltransferases and the 
flavin mononucleotide (FMN) binding site of flavodoxins 
were also closely connected. The phosphate binding site of 
tyrosine-protein phosphatases formed another group which 
was weakly connected to the FMN binding site of flavodox- 
ins. 

It is surprising that the heme binding site of globins 
(hemoglobins, myoglobins, cytoglobins, etc.) was also in- 
cluded in this component. Nevertheless, it was not directly 
connected to the main group of P-loops, but indirectly via 
the sparse group consisting of chloride ion binding site of 
T4 lysozymes and sulphate and phosphate binding sites 
of miscellaneous proteins. The binding sites of this latter 
group were made of regular structures at the termini of a- 
helices. When we used a more stringent P- value threshold 
(say, 10 -20 ), the groups of globins and lysozymes were de- 
tached from the main group, but the main group contain- 
ing the P-loops was almost unaffected (data not shown). 
Thus, the matches connecting globins, lysozymes, and P- 
loop containing proteins may be considered as 'false' hits. 
Based solely on structural similarity, however, they are dif- 
ficult to discriminate from 'true' hits (structural matches 
between functionally related sites) since many functional 
sites often include regular structures at termini of sec- 
ondary structures. Nevertheless, the fact that only a sub- 
set of regular structures were detected suggests that these 
matches may correspond to recurring structural patterns 
often used as building blocks of functional sites. In addition, 
we point out that weak but meaningful enzymatic functions 
are sometimes det ected experimentally in such 'false' hits 
" 2008h . 



(Ikura et al. 



The second largest connected component mainly con- 
sisted of mononucleotides or dinucleotides binding sites 
of the so-called Rossmann-like fold domains (Fig. 6C) 
which include, among others, NAD(P)-binding Rossmann- 
fold domains (SCOP: c.2), FAD/NAD (P)-binding domain 
(SCOP: c.3), nucleotide-binding domain (SCOP: c.4), 
SAM-dependent methyltransferases (SCOP: c.66), activat- 
ing enzymes of the ubiquitin-like proteins (SCOP: c.lll) 
and urocanase (SCOP: e.51). 

Peptide (and inhibitor) binding sites of trypsin-like and 
subtilisin-like proteases were found in the third largest com- 
ponent (Fig. 6D). These two proteases do not share a com- 
mon fold, but were connected due to the similarity of the 
active site structures around the well-known catalytic triad. 

The EF hand motif, a major calcium binding motif, was 
found in the fourth largest component (Fig. 6E) in which a 
variety of other calcium ion binding sites were also found. 
Although the main group in this component mainly con- 
sisted of the calcium ion binding sites of various calmodulin- 
like proteins, it also contained similar sites of periplasmic 
binding proteins (PBP). The ligands of these PBP's also in- 
clude sodium in addition to calcium ions. The main group 
was weakly connected to calcium ion binding sites of pro- 
teins of completely different folds such as galactose-binding 
domains (e.g., galactose oxidase, fucolectins), laminin G- 
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Fig. 6. Networks of structural motifs of ligand binding sites. A: Distribution of the size of connected component of the similarity network 
with varying P-value thresholds. A transition is observed at P=10 — 15 . B-F: The five largest connected components of the similarity network 
(P-value threshold = 1CT 15 ). Some groups of structural motifs are marked by black circles annotated with ligand types and protein folds. To 
facilitate visualization, each node (shown as a sphere) is represented as a complete-linkage cluster (structural motif) of ligand binding sites 
defined with P = 10 -15 (the sphere size is propo rtional to the cluster siz e). Nodes and edges are colored according to the values of their 
clustering coefficient (green: high; magenta: low) (Watts & Strog atzl Il998h . 



like modules (e.g., laminin, agrin, etc.), alpha- amylases, an- 
nexins, and phospholipase A2. Due to its spatial proxim- 
ity, the calcium binding site of phospholipase A2 was also 
connected to its inhibitor binding sites. 

Our last example, the fifth largest component, exhibited 
an exploding structure (Fig. 6F). Nevertheless, most bind- 
ing sites are associated with nucleotides. The main closely 
connected group consisted of the ATP (and inhibitors) 
binding sites of protein kinase family proteins, next to 



which the ADP binding sites of glutathione synthetase fam- 
ily proteins (including D-ala-D-ala ligases) were connected. 
Other closely connected groups included FAD binding sites 
of ferredoxin reductase-like proteins, ATP or magnesium 
binding sites of adenine nucleotide alpha hydrolases-like 
proteins, inhibitor binding sites of nitric-oxide synthases, 
and NAD (analog) binding sites of ADP-ribosylation pro- 
teins (e.g., T-cell ecto-ADP-ribosyltransferase 2, iota toxin, 
etc.). There was a large sparse group connected with the 
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Fig. 7. Ligand binding sites shared across different protein folds. A: 
20 most common pairs of different folds sharing significant ligand 
binding site similarities. B: 20 most common pairs of ligand types 
shared across different folds. 

main group of protein kinases. In that sparse group, lig- 
and binding sites of transthyretins (prealbumins) were of- 
ten found to be directly connected with that of protein ki- 
nases although their folds are different. These binding sites 
both involve a face of a /3-sheet, and their similarity was 
found due to the backbone conformation of the /3-sheet. 
Since many proteins bind their ligands on a face of a f3- 
sheet, this observation in turn explains the origin of the 
large sparse group. 



Significant similarities across different folds The similar- 
ity network of ligand binding sites revealed many structural 
similarities across different folds. To explore the extent of 
significant 'cross-fold' similarities (with P < 10 -15 ), we as- 
signed SCOP codes to as many structural motifs as possi- 
ble, and enumerated motif pairs whose members were sig- 
nificantly similar but did not share a common fold (Fig. 
7A). We also examined the ligand pairs in those matches, 
and found that most of them were reasonable matches (Fig. 
7B): metal ions were matched with metal ions, nucleotides 
with nucleotides or phosphate, and so on. Thus, many of 
these cross-fold similarities are expected to be function- 
ally relevant. The observation that sulfate (S04) binding 
sites were often found to be matched with mononucleotide 
(GDP and ATP) or phosphate (P04) binding sites (Fig. 
7B) confirms the usefulness of the former ligand in infer- 
ring the binding of the latter ligands, as often practiced by 
crystallographers. We note that multiple SCCS may be as- 
signed to a single motif if it contains multiple fold types 
or its member sites are located at an interface of multiple 
domains. In order to cover all possible fold pairs, we did 
not exclude motifs consisting of less than 10 binding sites 
in this analysis. There were in total 4035 pairs of struc- 
tural motifs (52,709 pairs of binding sites) that exhibited 
significant similarities but did not share the same fold. The 
complete list of these pairs is available on- line© 

The most common cross-fold similarity was found be- 
tween the P-loop containing NTH (SCOP: c.37) and the 



PEP-carboxykinase-like (SCOP: c.91) (c.f. Fig. 5A). As de- 
scribed in the analysis of complete-linkage clusters, this 
corresponds to mononucleotide or phosphate binding sites. 

Mononucleotide or dinucleotide binding sites of various 
Rossmann-like folds (SCOP: c.2, c.3, c.4, c.66) also exhib- 
ited significant mutual similarities (e.g., Figs. 5B and 8A). 

The calcium binding sites of EF hand-like fold (a. 39) were 
found to be similar to the metal binding sites of many folds 
including beta-propeller proteins (Fig. 8B) and periplas- 
mic binding proteins (SCOP: c.93 [class I], c.94 [class II]), 
lysozyme-like (SCOP: d.2), Zincin-like (SCOP: d.92), and 
many others. 

Similar zinc binding sites were found in many, mostly 
small, folds in addition to DHS-like NAD/FAD-binding do- 
main (SCOP: c.31) and Rubredoxin-like (g.41) (Fig. 8C), 
the former of which may be regarded as an inserted zinc 
finger motif. 

The similarity between globin-like (SCOP: a.l) and 
ferredoxin-like (SCOP: d.58) was due to the coordinated 
structures of the iron-sulfur clusters found in alpha-helical 
ferredoxins and ferredoxins, respectively. 

HAD-like fold proteins (SCOP: c.108) and CheY-like 
(flavodoxin fold) proteins (SCOP: c.23) often share similar 
binding sites (e.g., Fig. 8D). Interestingly, although these 
proteins have very similar topologies, the orders of aligned 
secondary structure elements were different when the align- 
ment was based on the ligand binding site similarity. 

As noted in the description of a network component (Fig. 
6F), protein kinases and transthyretins share similar bind- 
ing sites which are located on a face of a /3-sheet (Fig. 8E). 
Nevertheless, their ligand moieties seem also similar. 

Also as seen in the network component (Fig. 6B), phos- 
phate binding site of the P-loop motif exhibits a significant 
similarity with CoA binding site of acetyltransferases (Fig. 
8F). A close examination showed the phosphate bound to 
the P-loop motif coincided with the phosphate group of 
CoA bound to the acetyltransferase. 

The list of the cross-fold similarities contained many 
other examples including, but not limited to, those dis- 
cussed in the context of the similarity network. Here we 
give two other examples. Bacterial peptide deformylase 
2 (SCOP: d.167) and human macrophage metalloelastase 
(SCOP: d.92) both act with peptides, and their ligand bind- 
ing sites exhibit high structural similarity (Fig. 8G). DNA 
is one of the most abundant ligands found in cross-fold sim- 
ilarities (Fig. 7B). Not surprisingly, there can be also found 
similarity between binding sites for DNA and RNA. One ex- 
ample is the KH1 domain of human poly(rC)-binding pro- 
tein 2 which binds DNA and bacterial transcription elonga- 
tion protein NusA which binds RNA (Fig. 8H). T hese pro- 
teins have different variants of the KH domains (jGrishinL 
2001b| ). 



http: / / pdbjs6.pdbj .org/ ~akinjo/lbs / diffold.xml 



Discussion From the result of the exhaustive all-against- 
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Fig. 8. Examples of ligand binding sites shared across different folds. Th e color schemes are the same as in Fig. 5. A: AMP binding 
site of Thermotoga maritima hypothetical protein tml088a (PDB: 2G1U (Joint Center for Structural Genomics, 2006); SCOP: c.2; blue 
/ CPK-colored) and SAM binding site of human putative ribosomal RNA methvltransferase 2 fPDB: 2NYU Jwu et all l2006h : SCOP: 
c.66; pink / protein in magenta, SAM in green). B: Calcium binding sites of Clostridium thermocellum cellulosomal scaff olding pro- 
tein A (PDB: 2CCL (ICarvalho et all l2007h : SCOP: a.39; blue / CPK-colored) and human integrin alpha-lib (PDB: 1TXV (IXiao et all 
2004); SCOP: b.69; pin k / protein in magenta, calciu m in green). C: Zinc bind ing sites of human NAD-dependent deacetylase (PDB: 
2H4H (|Hoff et all l2006h : SCO P: c.31 [inferred by SSM jKrissinel & Henrickl. \2004) } : blue / CPK-colored) and Bacillus stearothermophilus 
adenylate kinase (PDB: 1ZIN (|Berrv &; Phillips Jr.L Il998h : SCOP: g.41: pink / protein in ma g enta, zinc in green). D: Formic acid bind- 
ing site of Xanthobacter autotrophicus L-2-haloacid dehalogenas e (PDB: 1AQ6 (Ridder ~e£ all Il997h : SCOP: c.108; blue / CPK-colored) 
and BeF 3- binding site of Escherichia coli PhoB (PDB: 1ZES (Bachha wat et ^1120051 ): SCOP: c.23; pink / protein in magenta, BeF 3 
in green). E: 3,5-diiodosalicylic acid binding site of human transthyretin (PDB: 3B56; SCOP: b.3; blue / CPK-colored) and inhibitor 
(N-[3-(4-fluorophenoxy)phenyl]-4-[(2-hydroxybenzyl)amino]piperidine-l-sulfonamide) binding site of human mitogen-activated protein kinase 
14 (PDB: 1ZZ2; SCOP: d.144; pink / protein in magenta, inhibitor in green). F: phosphate binding site of Pyrococcus furiosus Rad50 ABC-AT- 
Pase (PDB: 1118; SCOP: c.37; blue / CPK-colored) and coenzyme- A (CoA) binding site of Salmonella typhimurium LT2 acetyl transferase 
(PDB: 1S7N; SC OP: d.108; pink / prote in in magenta, CoA in green). G: Actinonin binding site of B. stearothermophilus peptide deformylase 
2 (PDB: 1LQY dGuilloteau et adl2002h : SCOP: d.167; blue / CPK-colored) and NNGH binding site of human macrophage metalloelastase 
(PDB: 1Z3J; SCO P: d.92; pink / p rotein in magenta, NNGH in green). H: DNA binding site of KH1 domain of human poly(rC)-binding pro- 
tein (PDB: 2AXY (iDu et ^.1 .12005): SCOP: d.51; blue / CPK-colored, D NA in orange) and RNA binding site of Mycobacterium tuberculosis 
transcription elongation protein NusA (PDB: 2ATW (|Beuth et all\20odi ): SCOP: d.52; pink / protein in magenta, RNA in green). 



all comparison, we were able to obtain an extensive list 
of ligand binding site similarities irrespective of sequence 
homology or global protein fold. The similarity network 
uncovered very many cross-fold similarities as well as well- 
known ones. Although it is still not clear how many of 
these similarities are functionally relevant, it was often 
observed that different folds were superimposable to a 
significant extent when the alignment was based on the 
ligand binding sites (e.g., Figs. 5, 8). Aligning protein 



structures based on ligand binding sites (or functional sites 
in general) irrespective of sequence similarity, sequence 
order, and protein fold (as currently defined) may be a 
useful approach to eluc i dating the evolutionary history 



uselul approacn to eluc i dating tne evolutionary nistory 
of fold changes dGrishinl. l2001at iKrishna fe Grishiul . 1 2004 



Andreeva & Murzi nj, 120061: iTavloil 120071: iGoldstehl |2008; 
Xie fc Bourne! l2008h . 

As was seen in the similarity network (Fig. 6), some links 
are based on the similarity of highly regular (secondary) 
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structures which are found in many protein structures (e.g., 
Fig. 6B,F). While such similarities may not be directly re- 
lated to any biochemical functions, they suggest that many 
ligand binding sites are based on combinations of some reg- 
ular local structures. It is known that a relatively small 
library of backbone fragmen ts can accurate l y mod el ter- 
tiary structures of proteins ( Kolodnv et all [2QQ2[ ). Con- 
sequently, the variety of contiguous fragments recurring 
in ligand binding sites is also limited as far as ba ckbone 
structure is concerned. Friedberg & Godzikl (|2QQ5h found 
similarities across different protein folds including those 
involved in various zinc-finger motifs and Rossmann-like 
folds, as shown in this study. They also showed significant 
correlations between similarity of fragments and that of 
protein functions. This observation is consistent with the 
present results in that it suggests that specific combina- 
tions of fragments encode specific functions. To apply the 
GIRAF method to functional annotations, however, it is 
preferable to discriminate functionally relevant similarities 
from purely structural similarities. 

Some of the short-comings of simple pairwise comparison 
may be overcome by the complete-linkage clustering analy- 
sis of similar binding sites, which allowed us to define precise 
structural motifs. It should be stressed tha t defining reliable 



motifs requires redundancy in the PDB ( Wangika r et al. 



2003), otherwise it would be more difficult to distinguish 
recurring structures from incidental matches. These motifs 
may be useful for defining; struc t ural t emplates for efficient 



motif matching (Wall ace et all 119971 ). Despite the diver 



sity of binding sites and their similarities, most motifs were 
found to be confined within single families or superfamilies, 
and they were also found to be highly specific to particular 
ligands. Thus, these motifs may be helpful for annotating 
putative functions of proteins, especially, of structural ge- 
nomics targets. 

In conclusion, the development of an extremely efficient 
search method (GIRAF) to detect local structural simi- 
larities made it possible to conduct the first exhaustive 
all-against-all comparison of all ligand binding sites in 
all the known protein structures. We identified a number 
of well-defined structural motifs, enumerated many non- 
trivial similarities. While exhaustive pairwise comparisons 
are useful for detecting weak and possibly partial similar- 
ities between ligand binding sites, the significance of such 
matches may not be immediately obvious because some of 
them may be based on ubiquitous regular structures. 

Meanwhile, complete-linkage clusters of ligand binding 
sites are useful for identifying functionally relevant bind- 
ing site structures, but they may neglect partial but sig- 
nificant matches. Therefore, these two approaches, exhaus- 
tive pairwise comparison and motif matching, are comple- 
mentary to each other, and hence the combination thereof 
may be helpful for more reliable annotations of proteins 
with unknown functions. These approaches may be further 
suppleme nted by other existing fold and /or sequence- based 



structure (not limited to its predefined ligand binding sites') 
to fin d potential ligand binding sites (|Kinjo fc Nakamural . 
20071 ). In this way, we are currently ann otating all struc- 
tural genomics targets ( Chen et~al\ , 2004 ). We also plan to 
make this method available as a web service so that struc- 
tural biologists can routinely search for ligand binding sites 
of their interest. 



Experimental Procedures 



The GIRAF method The details of the original GIRAF 
meth od has been published elsewhere ( Kin jo fc Nakamural . 
20071 ). In this study, an improved version of GIRAF was 
used for conducting the all-against-all comparison. The im- 
provement includes more sensitive geometric indexing with 
atomic composition around each reference set, simplified 
SQL expressions, and parallelization (A.R.K. and H.N., un- 
published). 



methods (Sta ndlev et all I2QQ8I : IXie fc Bourne! l2008h . The 



present method can be also applied to a whole protein 



All-against-all comparison Ligand binding sites were ex- 
tracted from PDBML files as described in Results. A ligand 
is defined as those entities that are annotated neither as 
"polypeptide (L)" with more than 24 amino acid residues 
nor "water" in the entity category. That is, a ligand can be 
polypeptide shorter than 25 residues, DNA, RNA, polysac- 
charides (sugars), lipids, metal ions, iron-sulfur clusters, or 
any other small molecules. However, ligands with more than 
1000 atoms were discarded. The all-against-all comparison 
was carried out on a cluster machine consisting of 20 nodes 
of 8-core processors (Intel Xeon 3.2 GHz). The whole com- 
putation was finished within approximately 60 hours. 



Clusters of similar ligand binding sites 

To obtain complete-linkage clusters, we first constructed 
a single-linkage network based on a pre-defined P-value 
threshold. Then this network was decomposed into con- 
nected components. Each component was then broken into 
finer components by imposing a more stringent P-value 
threshold. This decomposition was iterated until P-value 
threshold reached 10 -100 . Then bottom- up complete- 
linkage was iteratively applied to each connected compo- 
nent, the result of which was then combined into an upper 
component (previously determined with a higher P-value 
threshold). This bottom- up process was terminated when 
P-value threshold, 10 -15 , was reached. Each cluster was 
defined as a structural motif for the ligand binding sites. 
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Analysis of networks and structural motifs 
To a nnotate thus obtain ed structural motifs with the 
SCOP (jMurzin et fl/.l . ll995h codes, we used the parsable file 
of SCOP (version 1.73). When an analysis involved SCOP 
codes, those PDB entries whose SCOP classification has not 
yet been determined were ignored. Each SCOP SCCS code 
was assigned to a ligand binding site as described by others 
(|Gold fc Jacksoni l2006). When a site resides at an interface 
of multiple domains, multiple SCCS codes were assigned 
to the site. Two or more binding sites are said to share the 
same fold (or family, super family, etc.) if the intersection of 
their SCCS code sets is not empty. The SCCS code assigned 
to a structural motif was defined as the union of all the 
SCCS codes found in the corresponding cluster members. 
We used only the seven main SCOP classes (all-a [a], all- 
(3 [b], a//3 [c], a + (3 [d], multi-domain [e], membrane and 
cell surface proteins and peptides [f], and small proteins 
[g] ) . The figures of alignments (Figs. 5 and 8) we re created 
with jV version 3 ( Kinoshita fc NakamuraL 12004) using the 
PDBML-extatom files produced by GIRAF. The network 
figures (Fig. 6B-F) were created with the Tulip software 
( http: / / www, tulip- softwa re, or g/[ ) . 
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